1. Import required libraries and read the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df=pd.read_csv("Apps_data.csv")
df
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up |
| 10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up |
| 10838 | Parkinson Exercices FR | MEDICAL | NaN | 3 | 9.5M | 1,000+ | Free | 0 | Everyone | Medical | January 20, 2017 | 1.0 | 2.2 and up |
| 10839 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0 | Mature 17+ | Books & Reference | January 19, 2015 | Varies with device | Varies with device |
| 10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device |
10841 rows × 13 columns
2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features.
df.head(10)
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
| 5 | Paper flowers instructions | ART_AND_DESIGN | 4.4 | 167 | 5.6M | 50,000+ | Free | 0 | Everyone | Art & Design | March 26, 2017 | 1.0 | 2.3 and up |
| 6 | Smoke Effect Photo Maker - Smoke Editor | ART_AND_DESIGN | 3.8 | 178 | 19M | 50,000+ | Free | 0 | Everyone | Art & Design | April 26, 2018 | 1.1 | 4.0.3 and up |
| 7 | Infinite Painter | ART_AND_DESIGN | 4.1 | 36815 | 29M | 1,000,000+ | Free | 0 | Everyone | Art & Design | June 14, 2018 | 6.1.61.1 | 4.2 and up |
| 8 | Garden Coloring Book | ART_AND_DESIGN | 4.4 | 13791 | 33M | 1,000,000+ | Free | 0 | Everyone | Art & Design | September 20, 2017 | 2.9.2 | 3.0 and up |
| 9 | Kids Paint Free - Drawing Fun | ART_AND_DESIGN | 4.7 | 121 | 3.1M | 10,000+ | Free | 0 | Everyone | Art & Design;Creativity | July 3, 2018 | 2.8 | 4.0.3 and up |
df.shape
(10841, 13)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 App 10841 non-null object 1 Category 10841 non-null object 2 Rating 9367 non-null float64 3 Reviews 10841 non-null object 4 Size 10841 non-null object 5 Installs 10841 non-null object 6 Type 10840 non-null object 7 Price 10841 non-null object 8 Content Rating 10840 non-null object 9 Genres 10841 non-null object 10 Last Updated 10841 non-null object 11 Current Ver 10833 non-null object 12 Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB
3. Check summary statistics of the dataset. List out the columns that need to be worked upon for model building.
df.describe()
| Rating | |
|---|---|
| count | 9367.000000 |
| mean | 4.193338 |
| std | 0.537431 |
| min | 1.000000 |
| 25% | 4.000000 |
| 50% | 4.300000 |
| 75% | 4.500000 |
| max | 19.000000 |
df.describe(include='object')
| App | Category | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10841 | 10841 | 10841 | 10841 | 10841 | 10840 | 10841 | 10840 | 10841 | 10841 | 10833 | 10838 |
| unique | 9660 | 34 | 6002 | 462 | 22 | 3 | 93 | 6 | 120 | 1378 | 2832 | 33 |
| top | ROBLOX | FAMILY | 0 | Varies with device | 1,000,000+ | Free | 0 | Everyone | Tools | August 3, 2018 | Varies with device | 4.1 and up |
| freq | 9 | 1972 | 596 | 1695 | 1579 | 10039 | 10040 | 8714 | 842 | 326 | 1459 | 2451 |
mc=df.loc[:, df.columns != 'Rating'].columns
print('The columns that need to be worked upon for model building are')
mc
The columns that need to be worked upon for model building are
Index(['App', 'Category', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'],
dtype='object')
4. Check if there are any duplicate records in the dataset? if any drop them.
df.duplicated().sum()
483
df.drop_duplicates(inplace=True)
5. Check the unique categories of the column 'Category', Is there any invalid category? If yes, drop them.
df['Category'].unique()
array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
'1.9'], dtype=object)
df[df['Category']=='1.9']
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10472 | Life Made WI-Fi Touchscreen Photo Frame | 1.9 | 19.0 | 3.0M | 1,000+ | Free | 0 | Everyone | NaN | February 11, 2018 | 1.0.19 | 4.0 and up | NaN |
df.drop(index=10472,inplace=True)
6. Check if there are missing values present in the column Rating, If any? drop them and and create a new column as 'Rating_category' by converting ratings to high and low categories(>3.5 is high rest low)
df['Rating'].isna().sum()
1465
df.isnull().sum()
App 0 Category 0 Rating 1465 Reviews 0 Size 0 Installs 0 Type 1 Price 0 Content Rating 0 Genres 0 Last Updated 0 Current Ver 8 Android Ver 2 dtype: int64
df.dropna(subset='Rating',inplace=True)
df['Rating']
0 4.1
1 3.9
2 4.7
3 4.5
4 4.3
...
10834 4.0
10836 4.5
10837 5.0
10839 4.5
10840 4.5
Name: Rating, Length: 8892, dtype: float64
df=df[df['Category'].isnull()==False]
df
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 159 | 19M | 10,000+ | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 967 | 14M | 500,000+ | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 87510 | 8.7M | 5,000,000+ | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 215644 | 25M | 50,000,000+ | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 967 | 2.8M | 100,000+ | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10834 | FR Calculator | FAMILY | 4.0 | 7 | 2.6M | 500+ | Free | 0 | Everyone | Education | June 18, 2017 | 1.0.0 | 4.1 and up |
| 10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 38 | 53M | 5,000+ | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up |
| 10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 4 | 3.6M | 100+ | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up |
| 10839 | The SCP Foundation DB fr nn5n | BOOKS_AND_REFERENCE | 4.5 | 114 | Varies with device | 1,000+ | Free | 0 | Mature 17+ | Books & Reference | January 19, 2015 | Varies with device | Varies with device |
| 10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 398307 | 19M | 10,000,000+ | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device |
8892 rows × 13 columns
df['Rating_category']=df['Rating'].apply(lambda x:'High' if x > 3.5 else 'Low')
df['Rating_category']
0 High
1 High
2 High
3 High
4 High
...
10834 High
10836 High
10837 High
10839 High
10840 High
Name: Rating_category, Length: 8892, dtype: object
df['Rating_category'].unique()
array(['High', 'Low'], dtype=object)
7. Check the distribution of the newly created column 'Rating_category' and comment on the distribution.
df['Rating_category'].value_counts()
High 8012 Low 880 Name: Rating_category, dtype: int64
8. Convert the column "Reviews'' to numeric data type and check the presence of outliers in the column and handle the outliers using a transformation approach.(Hint: Use log transformation)
df['Reviews'].dtype
dtype('O')
df['Reviews'].unique()
array(['159', '967', '87510', ..., '603', '1195', '398307'], dtype=object)
df['Reviews']=df['Reviews'].astype('int')
df['Reviews'].dtype
dtype('int32')
import plotly.express as px
px.box(df['Reviews'])
df['Reviews']=np.log1p(df['Reviews'])
px.box(df['Reviews'])
9. The column 'Size' contains alphanumeric values, treat the non numeric data and convert the column into suitable data type. (hint: Replace M with 1 million and K with 1 thousand, and drop the entries where size='Varies with device')
df['Size'].dtype
dtype('O')
df['Size'].unique()
array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
'28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
'31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M',
'11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M',
'26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M',
'5.7M', '8.6M', '2.4M', '27M', '2.5M', '7.0M', '16M', '3.4M',
'8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
'2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
'7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
'5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
'42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
'41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
'7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
'3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
'3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M',
'4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M',
'2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M',
'7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k',
'93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M',
'67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k',
'41k', '292k', '11k', '80M', '1.7M', '10.0M', '74M', '62M', '69M',
'75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M',
'81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M',
'266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k',
'544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k',
'318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k',
'251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k',
'239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k',
'74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k',
'45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k',
'872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k',
'210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k',
'1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k',
'478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k',
'728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k',
'683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k',
'716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k',
'951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k',
'551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k',
'597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k',
'643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k',
'656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k',
'629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k',
'309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k',
'847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k',
'24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k',
'626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k',
'143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k',
'957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k',
'234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'],
dtype=object)
df['Size']=df['Size'].replace({'M':'*10**6','k':'*10**3','Varies with device':np.nan},regex=True)
df['Size'].unique()
array(['19*10**6', '14*10**6', '8.7*10**6', '25*10**6', '2.8*10**6',
'5.6*10**6', '29*10**6', '33*10**6', '3.1*10**6', '28*10**6',
'12*10**6', '20*10**6', '21*10**6', '37*10**6', '2.7*10**6',
'5.5*10**6', '17*10**6', '39*10**6', '31*10**6', '4.2*10**6',
'23*10**6', '6.0*10**6', '6.1*10**6', '4.6*10**6', '9.2*10**6',
'5.2*10**6', '11*10**6', '24*10**6', nan, '9.4*10**6', '15*10**6',
'10*10**6', '1.2*10**6', '26*10**6', '8.0*10**6', '7.9*10**6',
'56*10**6', '57*10**6', '35*10**6', '54*10**6', '201*10**3',
'3.6*10**6', '5.7*10**6', '8.6*10**6', '2.4*10**6', '27*10**6',
'2.5*10**6', '7.0*10**6', '16*10**6', '3.4*10**6', '8.9*10**6',
'3.9*10**6', '2.9*10**6', '38*10**6', '32*10**6', '5.4*10**6',
'18*10**6', '1.1*10**6', '2.2*10**6', '4.5*10**6', '9.8*10**6',
'52*10**6', '9.0*10**6', '6.7*10**6', '30*10**6', '2.6*10**6',
'7.1*10**6', '22*10**6', '6.4*10**6', '3.2*10**6', '8.2*10**6',
'4.9*10**6', '9.5*10**6', '5.0*10**6', '5.9*10**6', '13*10**6',
'73*10**6', '6.8*10**6', '3.5*10**6', '4.0*10**6', '2.3*10**6',
'2.1*10**6', '42*10**6', '9.1*10**6', '55*10**6', '23*10**3',
'7.3*10**6', '6.5*10**6', '1.5*10**6', '7.5*10**6', '51*10**6',
'41*10**6', '48*10**6', '8.5*10**6', '46*10**6', '8.3*10**6',
'4.3*10**6', '4.7*10**6', '3.3*10**6', '40*10**6', '7.8*10**6',
'8.8*10**6', '6.6*10**6', '5.1*10**6', '61*10**6', '66*10**6',
'79*10**3', '8.4*10**6', '3.7*10**6', '118*10**3', '44*10**6',
'695*10**3', '1.6*10**6', '6.2*10**6', '53*10**6', '1.4*10**6',
'3.0*10**6', '7.2*10**6', '5.8*10**6', '3.8*10**6', '9.6*10**6',
'45*10**6', '63*10**6', '49*10**6', '77*10**6', '4.4*10**6',
'70*10**6', '9.3*10**6', '8.1*10**6', '36*10**6', '6.9*10**6',
'7.4*10**6', '84*10**6', '97*10**6', '2.0*10**6', '1.9*10**6',
'1.8*10**6', '5.3*10**6', '47*10**6', '556*10**3', '526*10**3',
'76*10**6', '7.6*10**6', '59*10**6', '9.7*10**6', '78*10**6',
'72*10**6', '43*10**6', '7.7*10**6', '6.3*10**6', '334*10**3',
'93*10**6', '65*10**6', '79*10**6', '100*10**6', '58*10**6',
'50*10**6', '68*10**6', '64*10**6', '34*10**6', '67*10**6',
'60*10**6', '94*10**6', '9.9*10**6', '232*10**3', '99*10**6',
'624*10**3', '95*10**6', '8.5*10**3', '41*10**3', '292*10**3',
'11*10**3', '80*10**6', '1.7*10**6', '10.0*10**6', '74*10**6',
'62*10**6', '69*10**6', '75*10**6', '98*10**6', '85*10**6',
'82*10**6', '96*10**6', '87*10**6', '71*10**6', '86*10**6',
'91*10**6', '81*10**6', '92*10**6', '83*10**6', '88*10**6',
'704*10**3', '862*10**3', '899*10**3', '378*10**3', '4.8*10**6',
'266*10**3', '375*10**3', '1.3*10**6', '975*10**3', '980*10**3',
'4.1*10**6', '89*10**6', '696*10**3', '544*10**3', '525*10**3',
'920*10**3', '779*10**3', '853*10**3', '720*10**3', '713*10**3',
'772*10**3', '318*10**3', '58*10**3', '241*10**3', '196*10**3',
'857*10**3', '51*10**3', '953*10**3', '865*10**3', '251*10**3',
'930*10**3', '540*10**3', '313*10**3', '746*10**3', '203*10**3',
'26*10**3', '314*10**3', '239*10**3', '371*10**3', '220*10**3',
'730*10**3', '756*10**3', '91*10**3', '293*10**3', '17*10**3',
'74*10**3', '14*10**3', '317*10**3', '78*10**3', '924*10**3',
'818*10**3', '81*10**3', '939*10**3', '169*10**3', '45*10**3',
'965*10**3', '90*10**6', '545*10**3', '61*10**3', '283*10**3',
'655*10**3', '714*10**3', '93*10**3', '872*10**3', '121*10**3',
'322*10**3', '976*10**3', '206*10**3', '954*10**3', '444*10**3',
'717*10**3', '210*10**3', '609*10**3', '308*10**3', '306*10**3',
'175*10**3', '350*10**3', '383*10**3', '454*10**3', '1.0*10**6',
'70*10**3', '812*10**3', '442*10**3', '842*10**3', '417*10**3',
'412*10**3', '459*10**3', '478*10**3', '335*10**3', '782*10**3',
'721*10**3', '430*10**3', '429*10**3', '192*10**3', '460*10**3',
'728*10**3', '496*10**3', '816*10**3', '414*10**3', '506*10**3',
'887*10**3', '613*10**3', '778*10**3', '683*10**3', '592*10**3',
'186*10**3', '840*10**3', '647*10**3', '373*10**3', '437*10**3',
'598*10**3', '716*10**3', '585*10**3', '982*10**3', '219*10**3',
'55*10**3', '323*10**3', '691*10**3', '511*10**3', '951*10**3',
'963*10**3', '25*10**3', '554*10**3', '351*10**3', '27*10**3',
'82*10**3', '208*10**3', '551*10**3', '29*10**3', '103*10**3',
'116*10**3', '153*10**3', '209*10**3', '499*10**3', '173*10**3',
'597*10**3', '809*10**3', '122*10**3', '411*10**3', '400*10**3',
'801*10**3', '787*10**3', '50*10**3', '643*10**3', '986*10**3',
'516*10**3', '837*10**3', '780*10**3', '20*10**3', '498*10**3',
'600*10**3', '656*10**3', '221*10**3', '228*10**3', '176*10**3',
'34*10**3', '259*10**3', '164*10**3', '458*10**3', '629*10**3',
'28*10**3', '288*10**3', '775*10**3', '785*10**3', '636*10**3',
'916*10**3', '994*10**3', '309*10**3', '485*10**3', '914*10**3',
'903*10**3', '608*10**3', '500*10**3', '54*10**3', '562*10**3',
'847*10**3', '948*10**3', '811*10**3', '270*10**3', '48*10**3',
'523*10**3', '784*10**3', '280*10**3', '24*10**3', '892*10**3',
'154*10**3', '18*10**3', '33*10**3', '860*10**3', '364*10**3',
'387*10**3', '626*10**3', '161*10**3', '879*10**3', '39*10**3',
'170*10**3', '141*10**3', '160*10**3', '144*10**3', '143*10**3',
'190*10**3', '376*10**3', '193*10**3', '473*10**3', '246*10**3',
'73*10**3', '253*10**3', '957*10**3', '420*10**3', '72*10**3',
'404*10**3', '470*10**3', '226*10**3', '240*10**3', '89*10**3',
'234*10**3', '257*10**3', '861*10**3', '467*10**3', '676*10**3',
'552*10**3', '582*10**3', '619*10**3'], dtype=object)
df['Size']=df['Size'][df['Size'].isnull()==False].map(eval)
df['Size'].unique()
array([1.90e+07, 1.40e+07, 8.70e+06, 2.50e+07, 2.80e+06, 5.60e+06,
2.90e+07, 3.30e+07, 3.10e+06, 2.80e+07, 1.20e+07, 2.00e+07,
2.10e+07, 3.70e+07, 2.70e+06, 5.50e+06, 1.70e+07, 3.90e+07,
3.10e+07, 4.20e+06, 2.30e+07, 6.00e+06, 6.10e+06, 4.60e+06,
9.20e+06, 5.20e+06, 1.10e+07, 2.40e+07, nan, 9.40e+06,
1.50e+07, 1.00e+07, 1.20e+06, 2.60e+07, 8.00e+06, 7.90e+06,
5.60e+07, 5.70e+07, 3.50e+07, 5.40e+07, 2.01e+05, 3.60e+06,
5.70e+06, 8.60e+06, 2.40e+06, 2.70e+07, 2.50e+06, 7.00e+06,
1.60e+07, 3.40e+06, 8.90e+06, 3.90e+06, 2.90e+06, 3.80e+07,
3.20e+07, 5.40e+06, 1.80e+07, 1.10e+06, 2.20e+06, 4.50e+06,
9.80e+06, 5.20e+07, 9.00e+06, 6.70e+06, 3.00e+07, 2.60e+06,
7.10e+06, 2.20e+07, 6.40e+06, 3.20e+06, 8.20e+06, 4.90e+06,
9.50e+06, 5.00e+06, 5.90e+06, 1.30e+07, 7.30e+07, 6.80e+06,
3.50e+06, 4.00e+06, 2.30e+06, 2.10e+06, 4.20e+07, 9.10e+06,
5.50e+07, 2.30e+04, 7.30e+06, 6.50e+06, 1.50e+06, 7.50e+06,
5.10e+07, 4.10e+07, 4.80e+07, 8.50e+06, 4.60e+07, 8.30e+06,
4.30e+06, 4.70e+06, 3.30e+06, 4.00e+07, 7.80e+06, 8.80e+06,
6.60e+06, 5.10e+06, 6.10e+07, 6.60e+07, 7.90e+04, 8.40e+06,
3.70e+06, 1.18e+05, 4.40e+07, 6.95e+05, 1.60e+06, 6.20e+06,
5.30e+07, 1.40e+06, 3.00e+06, 7.20e+06, 5.80e+06, 3.80e+06,
9.60e+06, 4.50e+07, 6.30e+07, 4.90e+07, 7.70e+07, 4.40e+06,
7.00e+07, 9.30e+06, 8.10e+06, 3.60e+07, 6.90e+06, 7.40e+06,
8.40e+07, 9.70e+07, 2.00e+06, 1.90e+06, 1.80e+06, 5.30e+06,
4.70e+07, 5.56e+05, 5.26e+05, 7.60e+07, 7.60e+06, 5.90e+07,
9.70e+06, 7.80e+07, 7.20e+07, 4.30e+07, 7.70e+06, 6.30e+06,
3.34e+05, 9.30e+07, 6.50e+07, 7.90e+07, 1.00e+08, 5.80e+07,
5.00e+07, 6.80e+07, 6.40e+07, 3.40e+07, 6.70e+07, 6.00e+07,
9.40e+07, 9.90e+06, 2.32e+05, 9.90e+07, 6.24e+05, 9.50e+07,
8.50e+03, 4.10e+04, 2.92e+05, 1.10e+04, 8.00e+07, 1.70e+06,
7.40e+07, 6.20e+07, 6.90e+07, 7.50e+07, 9.80e+07, 8.50e+07,
8.20e+07, 9.60e+07, 8.70e+07, 7.10e+07, 8.60e+07, 9.10e+07,
8.10e+07, 9.20e+07, 8.30e+07, 8.80e+07, 7.04e+05, 8.62e+05,
8.99e+05, 3.78e+05, 4.80e+06, 2.66e+05, 3.75e+05, 1.30e+06,
9.75e+05, 9.80e+05, 4.10e+06, 8.90e+07, 6.96e+05, 5.44e+05,
5.25e+05, 9.20e+05, 7.79e+05, 8.53e+05, 7.20e+05, 7.13e+05,
7.72e+05, 3.18e+05, 5.80e+04, 2.41e+05, 1.96e+05, 8.57e+05,
5.10e+04, 9.53e+05, 8.65e+05, 2.51e+05, 9.30e+05, 5.40e+05,
3.13e+05, 7.46e+05, 2.03e+05, 2.60e+04, 3.14e+05, 2.39e+05,
3.71e+05, 2.20e+05, 7.30e+05, 7.56e+05, 9.10e+04, 2.93e+05,
1.70e+04, 7.40e+04, 1.40e+04, 3.17e+05, 7.80e+04, 9.24e+05,
8.18e+05, 8.10e+04, 9.39e+05, 1.69e+05, 4.50e+04, 9.65e+05,
9.00e+07, 5.45e+05, 6.10e+04, 2.83e+05, 6.55e+05, 7.14e+05,
9.30e+04, 8.72e+05, 1.21e+05, 3.22e+05, 9.76e+05, 2.06e+05,
9.54e+05, 4.44e+05, 7.17e+05, 2.10e+05, 6.09e+05, 3.08e+05,
3.06e+05, 1.75e+05, 3.50e+05, 3.83e+05, 4.54e+05, 1.00e+06,
7.00e+04, 8.12e+05, 4.42e+05, 8.42e+05, 4.17e+05, 4.12e+05,
4.59e+05, 4.78e+05, 3.35e+05, 7.82e+05, 7.21e+05, 4.30e+05,
4.29e+05, 1.92e+05, 4.60e+05, 7.28e+05, 4.96e+05, 8.16e+05,
4.14e+05, 5.06e+05, 8.87e+05, 6.13e+05, 7.78e+05, 6.83e+05,
5.92e+05, 1.86e+05, 8.40e+05, 6.47e+05, 3.73e+05, 4.37e+05,
5.98e+05, 7.16e+05, 5.85e+05, 9.82e+05, 2.19e+05, 5.50e+04,
3.23e+05, 6.91e+05, 5.11e+05, 9.51e+05, 9.63e+05, 2.50e+04,
5.54e+05, 3.51e+05, 2.70e+04, 8.20e+04, 2.08e+05, 5.51e+05,
2.90e+04, 1.03e+05, 1.16e+05, 1.53e+05, 2.09e+05, 4.99e+05,
1.73e+05, 5.97e+05, 8.09e+05, 1.22e+05, 4.11e+05, 4.00e+05,
8.01e+05, 7.87e+05, 5.00e+04, 6.43e+05, 9.86e+05, 5.16e+05,
8.37e+05, 7.80e+05, 2.00e+04, 4.98e+05, 6.00e+05, 6.56e+05,
2.21e+05, 2.28e+05, 1.76e+05, 3.40e+04, 2.59e+05, 1.64e+05,
4.58e+05, 6.29e+05, 2.80e+04, 2.88e+05, 7.75e+05, 7.85e+05,
6.36e+05, 9.16e+05, 9.94e+05, 3.09e+05, 4.85e+05, 9.14e+05,
9.03e+05, 6.08e+05, 5.00e+05, 5.40e+04, 5.62e+05, 8.47e+05,
9.48e+05, 8.11e+05, 2.70e+05, 4.80e+04, 5.23e+05, 7.84e+05,
2.80e+05, 2.40e+04, 8.92e+05, 1.54e+05, 1.80e+04, 3.30e+04,
8.60e+05, 3.64e+05, 3.87e+05, 6.26e+05, 1.61e+05, 8.79e+05,
3.90e+04, 1.70e+05, 1.41e+05, 1.60e+05, 1.44e+05, 1.43e+05,
1.90e+05, 3.76e+05, 1.93e+05, 4.73e+05, 2.46e+05, 7.30e+04,
2.53e+05, 9.57e+05, 4.20e+05, 7.20e+04, 4.04e+05, 4.70e+05,
2.26e+05, 2.40e+05, 8.90e+04, 2.34e+05, 2.57e+05, 8.61e+05,
4.67e+05, 6.76e+05, 5.52e+05, 5.82e+05, 6.19e+05])
df['Size'].dtype
dtype('float64')
df['Size'].isnull().sum()
1468
df.dropna(subset='Size',inplace=True)
10. Check the column 'Installs', treat the unwanted characters and convert the column into a suitable data type.
df['Installs'].unique()
array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
'50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
'1,000+', '500,000,000+', '100+', '500+', '10+', '1,000,000,000+',
'5+', '50+', '1+'], dtype=object)
df['Installs'].replace({',':''},regex=True,inplace = True)
df['Installs']=df['Installs'].str.replace("+","")
df['Installs'].unique()
array(['10000', '500000', '5000000', '50000000', '100000', '50000',
'1000000', '10000000', '5000', '100000000', '1000', '500000000',
'100', '500', '10', '1000000000', '5', '50', '1'], dtype=object)
df['Installs']=df['Installs'].astype('int')
df['Installs'].dtype
dtype('int32')
df
| App | Category | Rating | Reviews | Size | Installs | Type | Price | Content Rating | Genres | Last Updated | Current Ver | Android Ver | Rating_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Photo Editor & Candy Camera & Grid & ScrapBook | ART_AND_DESIGN | 4.1 | 5.075174 | 19000000.0 | 10000 | Free | 0 | Everyone | Art & Design | January 7, 2018 | 1.0.0 | 4.0.3 and up | High |
| 1 | Coloring book moana | ART_AND_DESIGN | 3.9 | 6.875232 | 14000000.0 | 500000 | Free | 0 | Everyone | Art & Design;Pretend Play | January 15, 2018 | 2.0.0 | 4.0.3 and up | High |
| 2 | U Launcher Lite – FREE Live Cool Themes, Hide ... | ART_AND_DESIGN | 4.7 | 11.379520 | 8700000.0 | 5000000 | Free | 0 | Everyone | Art & Design | August 1, 2018 | 1.2.4 | 4.0.3 and up | High |
| 3 | Sketch - Draw & Paint | ART_AND_DESIGN | 4.5 | 12.281389 | 25000000.0 | 50000000 | Free | 0 | Teen | Art & Design | June 8, 2018 | Varies with device | 4.2 and up | High |
| 4 | Pixel Draw - Number Art Coloring Book | ART_AND_DESIGN | 4.3 | 6.875232 | 2800000.0 | 100000 | Free | 0 | Everyone | Art & Design;Creativity | June 20, 2018 | 1.1 | 4.4 and up | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10833 | Chemin (fr) | BOOKS_AND_REFERENCE | 4.8 | 3.806662 | 619000.0 | 1000 | Free | 0 | Everyone | Books & Reference | March 23, 2014 | 0.8 | 2.2 and up | High |
| 10834 | FR Calculator | FAMILY | 4.0 | 2.079442 | 2600000.0 | 500 | Free | 0 | Everyone | Education | June 18, 2017 | 1.0.0 | 4.1 and up | High |
| 10836 | Sya9a Maroc - FR | FAMILY | 4.5 | 3.663562 | 53000000.0 | 5000 | Free | 0 | Everyone | Education | July 25, 2017 | 1.48 | 4.1 and up | High |
| 10837 | Fr. Mike Schmitz Audio Teachings | FAMILY | 5.0 | 1.609438 | 3600000.0 | 100 | Free | 0 | Everyone | Education | July 6, 2018 | 1.0 | 4.1 and up | High |
| 10840 | iHoroscope - 2018 Daily Horoscope & Astrology | LIFESTYLE | 4.5 | 12.894981 | 19000000.0 | 10000000 | Free | 0 | Everyone | Lifestyle | July 25, 2018 | Varies with device | Varies with device | High |
7424 rows × 14 columns
11. Check the column 'Price' , remove the unwanted characters and convert the column into a suitable data type.
df['Price'].unique()
array(['0', '$4.99', '$6.99', '$7.99', '$3.99', '$5.99', '$2.99', '$1.99',
'$9.99', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99',
'$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$3.49',
'$10.99', '$7.49', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99',
'$2.49', '$4.49', '$1.70', '$1.49', '$3.88', '$399.99', '$17.99',
'$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$1.59',
'$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99',
'$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59',
'$19.40', '$15.46', '$8.99', '$3.04', '$13.99', '$4.29', '$3.28',
'$4.60', '$1.00', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object)
df['Price']=df['Price'].str.replace("$",'')
df['Price']=df['Price'].astype(float)
df['Price'].dtypes
dtype('float64')
df['Price'].unique()
array([ 0. , 4.99, 6.99, 7.99, 3.99, 5.99, 2.99, 1.99,
9.99, 0.99, 9. , 5.49, 10. , 24.99, 11.99, 79.99,
16.99, 14.99, 29.99, 12.99, 3.49, 10.99, 7.49, 1.5 ,
19.99, 15.99, 33.99, 39.99, 2.49, 4.49, 1.7 , 1.49,
3.88, 399.99, 17.99, 400. , 3.02, 1.76, 4.84, 4.77,
1.61, 1.59, 6.49, 1.29, 299.99, 379.99, 37.99, 18.99,
389.99, 8.49, 1.75, 14. , 2. , 3.08, 2.59, 19.4 ,
15.46, 8.99, 3.04, 13.99, 4.29, 3.28, 4.6 , 1. ,
2.9 , 1.97, 2.56, 1.2 ])
12. Drop the columns which you think redundant for the analysis.(suggestion: drop column 'rating', since we created a new feature from it (i.e. rating_category) and the columns 'App', 'Rating' ,'Genres','Last Updated', 'Current Ver','Android Ver' columns since which are redundant for our analysis).
df.drop(columns=['Rating','Current Ver','Android Ver','Genres','Last Updated','App'],inplace=True)
df.columns
Index(['Category', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
'Content Rating', 'Rating_category'],
dtype='object')
df
| Category | Reviews | Size | Installs | Type | Price | Content Rating | Rating_category | |
|---|---|---|---|---|---|---|---|---|
| 0 | ART_AND_DESIGN | 5.075174 | 19000000.0 | 10000 | Free | 0.0 | Everyone | High |
| 1 | ART_AND_DESIGN | 6.875232 | 14000000.0 | 500000 | Free | 0.0 | Everyone | High |
| 2 | ART_AND_DESIGN | 11.379520 | 8700000.0 | 5000000 | Free | 0.0 | Everyone | High |
| 3 | ART_AND_DESIGN | 12.281389 | 25000000.0 | 50000000 | Free | 0.0 | Teen | High |
| 4 | ART_AND_DESIGN | 6.875232 | 2800000.0 | 100000 | Free | 0.0 | Everyone | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10833 | BOOKS_AND_REFERENCE | 3.806662 | 619000.0 | 1000 | Free | 0.0 | Everyone | High |
| 10834 | FAMILY | 2.079442 | 2600000.0 | 500 | Free | 0.0 | Everyone | High |
| 10836 | FAMILY | 3.663562 | 53000000.0 | 5000 | Free | 0.0 | Everyone | High |
| 10837 | FAMILY | 1.609438 | 3600000.0 | 100 | Free | 0.0 | Everyone | High |
| 10840 | LIFESTYLE | 12.894981 | 19000000.0 | 10000000 | Free | 0.0 | Everyone | High |
7424 rows × 8 columns
13. Encode the categorical columns.
col=df.select_dtypes(include='object').columns
for i in col:
print(i," : ")
print(df[i].unique())
Category : ['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE' 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT' 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME' 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL' 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS' 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS' 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION'] Type : ['Free' 'Paid'] Content Rating : ['Everyone' 'Teen' 'Everyone 10+' 'Mature 17+' 'Adults only 18+' 'Unrated'] Rating_category : ['High' 'Low']
le_col=['Category','Content Rating','Rating_category']
ohe_col=['Type']
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
le=LabelEncoder()
for i in le_col:
df[i]=le.fit_transform(df[i])
ohe=OneHotEncoder(sparse=False)
ohe_df=pd.DataFrame(ohe.fit_transform(df[ohe_col]))
ohe_df
| 0 | 1 | |
|---|---|---|
| 0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 |
| 2 | 1.0 | 0.0 |
| 3 | 1.0 | 0.0 |
| 4 | 1.0 | 0.0 |
| ... | ... | ... |
| 7419 | 1.0 | 0.0 |
| 7420 | 1.0 | 0.0 |
| 7421 | 1.0 | 0.0 |
| 7422 | 1.0 | 0.0 |
| 7423 | 1.0 | 0.0 |
7424 rows × 2 columns
a=ohe.categories_
a
[array(['Free', 'Paid'], dtype=object)]
ohe_df.columns=list(a[0])
ohe_df
| Free | Paid | |
|---|---|---|
| 0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.0 |
| 2 | 1.0 | 0.0 |
| 3 | 1.0 | 0.0 |
| 4 | 1.0 | 0.0 |
| ... | ... | ... |
| 7419 | 1.0 | 0.0 |
| 7420 | 1.0 | 0.0 |
| 7421 | 1.0 | 0.0 |
| 7422 | 1.0 | 0.0 |
| 7423 | 1.0 | 0.0 |
7424 rows × 2 columns
df.reset_index(drop=True,inplace=True)
df_final=pd.concat((df,ohe_df),axis='columns',ignore_index=True)
df_final
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5.075174 | 19000000.0 | 10000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 1 | 0 | 6.875232 | 14000000.0 | 500000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 2 | 0 | 11.379520 | 8700000.0 | 5000000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 3 | 0 | 12.281389 | 25000000.0 | 50000000 | Free | 0.0 | 4 | 0 | 1.0 | 0.0 |
| 4 | 0 | 6.875232 | 2800000.0 | 100000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7419 | 3 | 3.806662 | 619000.0 | 1000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7420 | 11 | 2.079442 | 2600000.0 | 500 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7421 | 11 | 3.663562 | 53000000.0 | 5000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7422 | 11 | 1.609438 | 3600000.0 | 100 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7423 | 18 | 12.894981 | 19000000.0 | 10000000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
7424 rows × 10 columns
df_final.columns=list(df.columns)+list(ohe_df.columns)
df_final
| Category | Reviews | Size | Installs | Type | Price | Content Rating | Rating_category | Free | Paid | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5.075174 | 19000000.0 | 10000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 1 | 0 | 6.875232 | 14000000.0 | 500000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 2 | 0 | 11.379520 | 8700000.0 | 5000000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 3 | 0 | 12.281389 | 25000000.0 | 50000000 | Free | 0.0 | 4 | 0 | 1.0 | 0.0 |
| 4 | 0 | 6.875232 | 2800000.0 | 100000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7419 | 3 | 3.806662 | 619000.0 | 1000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7420 | 11 | 2.079442 | 2600000.0 | 500 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7421 | 11 | 3.663562 | 53000000.0 | 5000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7422 | 11 | 1.609438 | 3600000.0 | 100 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
| 7423 | 18 | 12.894981 | 19000000.0 | 10000000 | Free | 0.0 | 1 | 0 | 1.0 | 0.0 |
7424 rows × 10 columns
df_final.drop(columns=ohe_col,inplace=True)
df_final.columns
Index(['Category', 'Reviews', 'Size', 'Installs', 'Price', 'Content Rating',
'Rating_category', 'Free', 'Paid'],
dtype='object')
14. Segregate the target and independent features (Hint: Use Rating_category as the target)
y=df['Rating_category']
X=df_final.drop(columns='Rating_category')
X
| Category | Reviews | Size | Installs | Price | Content Rating | Free | Paid | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 5.075174 | 19000000.0 | 10000 | 0.0 | 1 | 1.0 | 0.0 |
| 1 | 0 | 6.875232 | 14000000.0 | 500000 | 0.0 | 1 | 1.0 | 0.0 |
| 2 | 0 | 11.379520 | 8700000.0 | 5000000 | 0.0 | 1 | 1.0 | 0.0 |
| 3 | 0 | 12.281389 | 25000000.0 | 50000000 | 0.0 | 4 | 1.0 | 0.0 |
| 4 | 0 | 6.875232 | 2800000.0 | 100000 | 0.0 | 1 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7419 | 3 | 3.806662 | 619000.0 | 1000 | 0.0 | 1 | 1.0 | 0.0 |
| 7420 | 11 | 2.079442 | 2600000.0 | 500 | 0.0 | 1 | 1.0 | 0.0 |
| 7421 | 11 | 3.663562 | 53000000.0 | 5000 | 0.0 | 1 | 1.0 | 0.0 |
| 7422 | 11 | 1.609438 | 3600000.0 | 100 | 0.0 | 1 | 1.0 | 0.0 |
| 7423 | 18 | 12.894981 | 19000000.0 | 10000000 | 0.0 | 1 | 1.0 | 0.0 |
7424 rows × 8 columns
y
0 0
1 0
2 0
3 0
4 0
..
7419 0
7420 0
7421 0
7422 0
7423 0
Name: Rating_category, Length: 7424, dtype: int32
15. Split the dataset into train and test.
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=555)
X_train
| Category | Reviews | Size | Installs | Price | Content Rating | Free | Paid | |
|---|---|---|---|---|---|---|---|---|
| 5331 | 12 | 8.364275 | 24000000.0 | 50000 | 0.0 | 1 | 1.0 | 0.0 |
| 7378 | 25 | 2.890372 | 2400000.0 | 1000 | 0.0 | 1 | 1.0 | 0.0 |
| 7369 | 19 | 5.814131 | 676000.0 | 10000 | 0.0 | 1 | 1.0 | 0.0 |
| 3663 | 26 | 9.915811 | 47000000.0 | 5000000 | 0.0 | 1 | 1.0 | 0.0 |
| 123 | 3 | 11.412763 | 5900000.0 | 5000000 | 0.0 | 1 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2628 | 6 | 9.644717 | 2200000.0 | 1000000 | 0.0 | 1 | 1.0 | 0.0 |
| 1057 | 14 | 10.999680 | 7800000.0 | 5000000 | 0.0 | 1 | 1.0 | 0.0 |
| 7145 | 14 | 12.677123 | 27000000.0 | 50000000 | 0.0 | 4 | 1.0 | 0.0 |
| 4782 | 27 | 1.791759 | 1800000.0 | 5 | 0.0 | 1 | 1.0 | 0.0 |
| 6554 | 14 | 13.174189 | 49000000.0 | 10000000 | 0.0 | 2 | 1.0 | 0.0 |
5568 rows × 8 columns
X_test
| Category | Reviews | Size | Installs | Price | Content Rating | Free | Paid | |
|---|---|---|---|---|---|---|---|---|
| 4515 | 11 | 7.547502 | 72000000.0 | 50000 | 0.0 | 1 | 1.0 | 0.0 |
| 7003 | 4 | 3.332205 | 18000000.0 | 5000 | 0.0 | 1 | 1.0 | 0.0 |
| 1767 | 26 | 10.477232 | 16000000.0 | 1000000 | 0.0 | 1 | 1.0 | 0.0 |
| 2169 | 25 | 8.701513 | 4300000.0 | 500000 | 0.0 | 1 | 1.0 | 0.0 |
| 3420 | 11 | 7.355641 | 35000000.0 | 50000 | 0.0 | 1 | 1.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4666 | 11 | 6.242223 | 9800000.0 | 50000 | 0.0 | 1 | 1.0 | 0.0 |
| 3662 | 16 | 4.779123 | 72000000.0 | 50000 | 0.0 | 1 | 1.0 | 0.0 |
| 6416 | 11 | 8.187021 | 19000000.0 | 500000 | 0.0 | 1 | 1.0 | 0.0 |
| 3232 | 25 | 8.538563 | 1400000.0 | 500000 | 0.0 | 1 | 1.0 | 0.0 |
| 5169 | 21 | 2.079442 | 3400000.0 | 1000 | 0.0 | 2 | 1.0 | 0.0 |
1856 rows × 8 columns
y_train
5331 0
7378 0
7369 0
3663 0
123 0
..
2628 0
1057 0
7145 0
4782 0
6554 0
Name: Rating_category, Length: 5568, dtype: int32
y_test
4515 0
7003 0
1767 0
2169 0
3420 0
..
4666 0
3662 0
6416 0
3232 0
5169 0
Name: Rating_category, Length: 1856, dtype: int32
X_train.shape,X_test.shape,y_train.shape,y_test.shape
((5568, 8), (1856, 8), (5568,), (1856,))
16. Standardize the data, so that the values are within a particular range.
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit_transform(X,y)
array([[-2.03766618, -0.69434673, -0.15992777, ..., -0.46322046,
0.28202925, -0.28202925],
[-2.03766618, -0.20638643, -0.37330014, ..., -0.46322046,
0.28202925, -0.28202925],
[-2.03766618, 1.01463714, -0.59947486, ..., -0.46322046,
0.28202925, -0.28202925],
...,
[-0.68621673, -1.07700695, 1.29100439, ..., -0.46322046,
0.28202925, -0.28202925],
[-0.68621673, -1.63383939, -0.81711468, ..., -0.46322046,
0.28202925, -0.28202925],
[ 0.17379656, 1.42544875, -0.15992777, ..., -0.46322046,
0.28202925, -0.28202925]])